Text Retrieval and Mining :: 개발참고자료[SSISO Community]

SSISO 카페

SSISO Source

SSISO 구직

SSISO 쇼핑몰

SSISO 맛집

추천검색어 : JUnit Log4j ajax spring struts struts-config.xml Synchronized 책정보 Ajax 마스터하기

우측부분

개발참고자료

[1]

등록일:2008-04-07 12:46:42

(0%)
작성자:

제목:Text Retrieval and Mining

Text Retrieval and Mining
# 11-1

Information Extraction

Lecture by Young Hwan CHO, Ph. D.

Youngcho@gmail.com

Page 2

Plan for Today

Information Extraction

Introduction to the IE problem
Wrappers
Wrapper Induction
Traditional NLP-based IE
Pattern Learning Systems: Rapier
Probabilistic sequence models: HMMs

Page 3

What is Information Extraction?

the extraction or pulling out of pertinent information from large volumes of texts
어떠한 문서를 사용자가 읽어야 한다는 것을 알려주기보다는 사용자에게 필요한 부분의 정보의 조각을 추출하고, 추출된 정보와 원래의 문서간의 링크를 유지해서 사용자가 내용을 참조하도록 링크하는 것
이러한 정보는 신뢰성이 높고 자세하여야 하는데, 최근의 기술로는 아래와 같은 수준을 보인다.

an activity or occurrence of interest such as a terrorist act or an airline crash

a relationship held between two or more entities

a property of an entity such as its name, alias, descriptor, or type

an object of interest such as a person or organization

Definitions

Events

Facts

Attributes

Entities

Percentile Reliability

Items of Information

Page 4

IE from the Web: The Big Picture

Page 5

Information Extraction의 컴포넌트

Spider : 웹 페이지 수집

대상이 되는 웹페이지를 수집, 다음 페이지 URL 찾기

Wrapper : HTML 페이지 -> XML DB

CGI 스타일의 페이지에서는 Wrapper 만으로도 충분히 역할을 할 수 있음

NLP Lib : 문장에서 정보 추출

Free Style의 HTML, 설명형태의 글, 뉴스 등에서 특정 Fact 수집
DB는 Text 보다 과거의 데이터를 담고 있음

Information Cooking

Identification : 문서 스타일 판별
Segmentation : 문서의 구성요소 조각 나눔
Classification : 문서내의 entity 범주화, 문서 범주화
Clustering : 문서내의 entury 군집화, 문서 군집화
Association : 문서내의 정보를 DB의 Field로 매핑

Page 6

Examples : Corpus

Fletcher Maddox, former Dean of the UCSD Business School, announced the formation of La Jolla Genomatics together with his two sons. La Jolla Genomatics will release its product Geninfo in June 1999. Geninfo is a turnkey system to assist biotechnology researchers in keeping up with the voluminous literature in all aspects of their field.
Dr. Maddox will be the firm's CEO. His son, Oliver, is the Chief Scientist and holds patents on many of the algorithms used in Geninfo. Oliver's brother, Ambrose, follows more in his father's footsteps and will be the CFO of L.J.G. headquartered in the Maddox family's hometown of La Jolla, CA.

Page 7

Examples : Entity

Persons:

Organizations:

Locations:

Artifacts:

Dates:

Fletcher Maddox

UCSD Business School

La Jolla

Geninfo

June 1999

Dr. Maddox

La Jolla Genomatics

Geninfo

Oliver

La Jolla Genomatics

Oliver

L.J.G.

Ambrose

Maddox

Page 8

Examples : Attributes

Fletcher Maddox

Maddox

former Dean of the UCSD Business School

his father
the firm's CEO

PERSON

Oliver

His son
Chief Scientist

PERSON

Ambrose

Oliver's brother
the CFO of L.J.G.

PERSON

UCSD Business School

NAME:

DESCRIPTOR:

CATEGORY:

NAME:

DESCRIPTOR:

CATEGORY:

NAME:

DESCRIPTOR:

CATEGORY:

NAME:

DESCRIPTOR:

CATEGORY:

ORGANIZATION

La Jolla Genomatics
L.J.G.

ORGANIZATION

Geninfo

its product

ARTIFACT

La Jolla

the Maddox family's hometown

LOCATION

NAME:

DESCRIPTOR:

CATEGORY:

NAME:

DESCRIPTOR:

CATEGORY:

NAME:

DESCRIPTOR:

CATEGORY:

NAME:

DESCRIPTOR:

CATEGORY:

LOCATION

Page 9

Examples : Facts

PERSON

Employee_of

ORGANIZATION

Fletcher Maddox
Fletcher Maddox
Oliver
Ambrose

Employee_of
Employee_of
Employee_of
Employee_of

UCSD Business School
La Jolla Genomatics
La Jolla Genomatics
La Jolla Genomatics

ARTIFACT

Product_of

ORGANIZATION

Geninfo

Product_of

La Jolla Genomatics

LOCATION

Location_of

ORGANIZATION

La Jolla

Location_of

La Jolla Genomatics

Location_of

La Jolla Genomatics

Page 10

Examples : Events

COMPANY:

La Jolla Genomatics

PRINCIPALS:

Fletcher Maddox
Oliver
Ambrose

DATE:

CAPITAL:

COMPANY:

La Jolla Genomatics

PRODUCT:

Geninfo

DATE:

June 1999

COST:

회사설립 이벤트

상품출시 이벤트

Page 11

Unstructured Data -> Strcutured/Semi-Structured Data

Task = Filling slots in a database from sub-segments of text
Techniques = Segmentation + classification + clustering + association

Page 12

Source Styles

Page 13

Segmentation

Extract metadata (e.g. author, title, date)
Identify sections (e.g. abstract)
Extract keywords

Page 14

Clustering + Classification

Document 내부에서

문서내의 Named Entity에 대해서 Entity Type을 판단

인명, 직책, 기관명, 날짜, 기관, 단위, 주소
제목, 나열형 문장, 설명형 문장

동일 데이터 형태가 나열된 경우에, 밝혀진 것과 동일한 패턴으로 나열된 데이터에 대해서 동일한 filed로 인정

여러 Document로부터

추출된 정보의 신뢰도를 측정 (문서의 중요도, 분야의 적합성)
다수의 Source에서 수집된 정보에 대해서 상호 비교

Page 15

Association

Page 16

Global vs Local Extrations

Local Extraction models

하나의 웹사이트로부터 정보를 추출
해당 사이트에 꼭 맞춘 형식화된 XML 스타일로 HTML 문서를 변환

Global Extraction models

많은 웹 사이트의 텍스트로부터 필드화된 정보를 추출

두 모델을 혼합

Local model은 Global model의 학습용 데이터 혹은 정확도가 높은 초기 DB를 추출해 줄 수 있음
Global model은 Local model에서 발생하지 않은 새로운 데이터나 새로운 필드를 추가해 줄 수 있음

Page 17

Information Extraction in Real

CGI로 생성된 HTML 페이지

생성 : DB -> (CGI) -> HTML
리버스엔지니어링 : HTML -> (Crawler) -> (Wrapper) -> DB

News, Report

언어적인 분석을 통해서 Entity, Attribute, Fact, Event를 추출하여야 함

Page 18

Extracting Corporate Information

Data automatically

extracted from

marketsoft.com

Source web page.

Color highlights

indicate type of

information.

(e.g., red = name)

E.g., information need: Who is the

CEO of MarketSoft?

Source: Whizbang! Labs/

Andrew McCallum

Page 19

Product information

Page 20

Product information

Page 21

Canonicalization: Product information

Page 22

Wrappers

에이전트를 이용한 정보추출을 위해서는 각 문서에 대해서 추출하고자 하는 정보의 위치와 구조, 포맷 등을 나타내는 규칙이 필요하며 일반적으로 이러한 규칙을 wrapper라고 한다.
Wrapper의 작성

수동 작성 : 정보 추출의 정확성을 높일수 있지만 문서가 변경되면 대책이 없음
자동 생성 : 도메인 지식과 샘플문서를 이용해서 자동 생성, 문서 변경에 대응

Page 23

Amazon Book Description

….

</td></tr>

</table>

The Age of Spiritual Machines : When Computers Exceed Human Intelligence

by <a href="/exec/obidos/search-handle-url/index=books&field-author=

Kurzweil%2C%20Ray/002-6235079-4593641">

Ray Kurzweil</a>

<img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90

height=140 align=left border=0></a>

List Price: $14.95

Our Price: $11.96

You Save: $2.99

(20%)

…

Page 24

Extracted Book Template

Title: The Age of Spiritual Machines :

When Computers Exceed Human Intelligence

Author: Ray Kurzweil

List-Price: $14.95

Price: $11.96

Page 25

Wrappers: Simple Extraction Patterns

Specify an item to extract for a slot using a regular expression pattern.

Price pattern: “\b\$\d+(\.\d{2})?\b”

May require preceding (pre-filler) pattern to identify proper context.

Amazon list price:

Pre-filler pattern: “List Price: ”
Filler pattern: “\$\d+(\.\d{2})?\b”

May require succeeding (post-filler) pattern to identify the end of the filler.

Amazon list price:

Pre-filler pattern: “List Price: ”
Filler pattern: “.+”
Post-filler pattern: “”

Page 26

Wrapper induction

Highly regular
source documents

Relatively simple
extraction patterns

Efficient
learning algorithm

Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering.
Alternative is to use machine learning:

학습용 데이터 (문서와 사람이 만든 규칙 pair)를 구축한다.
HTML 문서에서 각 항목의 주위에 나타나는 특정 패턴을 자동 학습한다.

Page 27

Use , , ,  for extraction

<HTML><TITLE>Some Country Codes</TITLE>

Congo 242

Egypt 20

Belize 501

Spain 34

</BODY></HTML>



Wrapper induction: Delimiter-based extraction

Page 28

l₁, r₁, …, l_K, r_K

Example: Find 4 strings

, , , 

 l₁, r₁, l₂ , r₂ 

labeled pages

wrapper

<HTML><HEAD>Some Country Codes</HEAD>
Congo 242 
Egypt 20 
Belize 501 
Spain 34 
</BODY></HTML>

Learning LR wrappers

Page 29

LR: Finding r₁

<HTML><TITLE>Some Country Codes</TITLE>
Congo 242 
Egypt 20 
Belize 501 
Spain 34 
</BODY></HTML>

r₁ can be any prefix
eg

Page 30

LR: Finding l₁, l₂ and r₂

<HTML><TITLE>Some Country Codes</TITLE>
Congo 242 
Egypt 20 
Belize 501 
Spain 34 
</BODY></HTML>

r₂ can be any prefix
eg

l₂ can be any suffix

eg

l₁ can be any suffix
eg

Page 31

Wrapper 생성기 : 전체 흐름도와 도메인 지식 표현

Page 32

Wrapper 생성기 : 전처리 – 논리라인 생성

브라우저 상에 출력되는 형태처럼 눈에 보이지 않는 HTML 태그를 제거하고 테이블 관련 태그(예를 들어, TR, TH 등)나 라인을 분리할 때 사용되는 리스트형 태그(예를 들어, BR, P, LI)를 기준으로 라인을 분리

Page 33

Wrapper 생성기 : 도메인 지식을 이용해서 논리 라인 의미분석

도메인 지식의 각 OBJECT에 대한 패턴을

논리라인으로부터 찾아서

일치하는 FORMAT을 기록한다.

Page 34

XML 규칙 생성

도메인 지식을 적용해서 HTML에서 특정 패턴에 대한 현상을 XML 문서로 기술하여 XML 파일로 저장한다.
이 XML로 기술된 정보추출 규칙에 따라서 해당 HTML 문서에서의 정보를 추출한다.

Page 35

Natural Language Processing-based IE

If extracting from more natural, unstructured, human-written text, some NLP may help.

Part-of-speech (POS) tagging (품사 태깅)

Mark each word as a noun, verb, preposition, etc.

Syntactic parsing (명사구, 동사구, 관형어구)

Identify phrases: NP, VP, PP

Semantic word categories (e.g. from WordNet)

KILL: kill, murder, assassinate, strangle, suffocate

Extraction patterns can use POS or phrase tags.

Crime victim: 누가 [ 죽였다 누구를]

Prefiller: [POS: V, Hypernym: KILL]
Filler: [Phrase: NP]

Page 36

Finite state automata transductions

’s

ADJ

Art

’s

Art

John’s interesting

book with a nice cover

Pattern-maching

PN ’s (ADJ)* N P Art (ADJ)* N

{PN ’s | Art}(ADJ)* N (P Art (ADJ)* N)*

Page 37

Rule-based Extraction Examples

Determining which person holds what office in what organization

[person] , [office] of [org]

Vuk Draskovic, leader of the Serbian Renewal Movement

[org] (named, appointed, etc.) [person] P [office]

NATO appointed Wesley Clark as Commander in Chief

Determining where an organization is located

[org] in [loc]

NATO headquarters in Brussels

[org] [loc] (division, branch, headquarters, etc.)

KFOR Kosovo headquarters

Page 38

Three generations of IE systems

Hand-Built Systems – Knowledge Engineering [1980s– ]

규칙을 직접 작성
해당 분야와 정보추출 시스템에 능통한 전문가가 필요
{ 추측 – 실험 -변경 } 을 반복함

Automatic, Trainable Rule-Extraction Systems [1990s– ]

미리 정의된 템플렛을 이용해서 규칙을 자동으로 발견하는 시스템
대규모의 labeled corpora가 필요

Statistical Generative Models [1997 – ]

문서에서 연관성이 있는 부분을 찾아내는 통계적인 모델 이용 - using HMMs or statistical parsers
Learning usually supervised; may be partially unsupervised

Page 39

Evaluating IE Accuracy

시스템 개발에서 사용되지 않은 사람이 직접 만든 테스트 데이터를 사용하여 성능을 측정
Template Measure for each test document:

Total number of correct extractions in the solution template: N
Total number of slot/value pairs extracted by the system: E
Number of extracted slot/value pairs that are correct (i.e. in the solution template): C

Compute average value of metrics adapted from IR:

Recall = C/N
Precision = C/E
F-Measure = Harmonic mean of recall and precision

Page 40

MUC: the genesis of IE

DARPA funded significant efforts in IE in the early to mid 1990’s.
Message Understanding Conference (MUC) was an annual event/competition where results were presented.
Focused on extracting information from news articles:

Terrorist events
Industrial joint ventures
Company management changes

Information extraction of particular interest to the intelligence community (CIA, NSA).

참조 사이트

http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html

Page 41

MUC Information Extraction: State of the Art c. 1997

NE – named entity recognition

CO – coreference resolution

TE – template element construction

TR – template relation construction

ST – scenario template production

Page 42

Basic IE References

Douglas E. Appelt and David Israel. 1999. Introduction to Information Extraction Technology. IJCAI 1999 Tutorial. http://www.ai.sri.com/~appelt/ie-tutorial/
Kushmerick, Weld, Doorenbos: Wrapper Induction for Information Extraction,IJCAI 1997. http://www.cs.ucd.ie/staff/nick/
Stephen Soderland: Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning 34(1-3): 233-272 (1999)

Dun & Bradstreet is the oldest, largest most established seller of business info in the world. They maintain a DB of all 11M US companies, and they do it very inefficiently: phone calls.

We are extracting basic company identification information, like name, address, phone, fax, email from over 10M domain names.

Again, on left, original page, with markup showing where WB extracted the DB fields, which are shown on right.

Again, formatting and position on page is very indicative here. Relative position of entities says something about how they go together---which person with which title, etc.

[본문링크] Text Retrieval and Mining

[1]

코멘트(이글의 트랙백 주소:/cafe/tb_receive.php?no=7320

작성자
비밀번호

SSISOCommunity

[이전]